Skip to content

feat: add cloudflare-metrics worker for graphql analytics export#28

Merged
zackpollard merged 109 commits intomainfrom
feat/cloudflare-metrics-exporter
Apr 22, 2026
Merged

feat: add cloudflare-metrics worker for graphql analytics export#28
zackpollard merged 109 commits intomainfrom
feat/cloudflare-metrics-exporter

Conversation

@zackpollard
Copy link
Copy Markdown
Member

Adds a new Cloudflare Worker that runs on a 5-minute cron, queries the
Cloudflare GraphQL Analytics API for every resource type we currently
use (and several we don't yet), and pushes the data into VictoriaMetrics
via the existing InfluxDB line-protocol endpoint.

What's collected

20 datasets, each mapped to a cf_* measurement with snake_case tags and fields:

  • Workers: cf_workers_invocations, cf_workers_subrequests, cf_workers_overview
  • D1: cf_d1_queries, cf_d1_storage
  • R2: cf_r2_operations, cf_r2_storage
  • KV: cf_kv_operations, cf_kv_storage
  • Durable Objects: cf_durable_objects_invocations, cf_durable_objects_periodic, cf_durable_objects_storage, cf_durable_objects_sql_storage, cf_durable_objects_subrequests
  • Queues: cf_queue_operations, cf_queue_backlog
  • Hyperdrive: cf_hyperdrive_queries, cf_hyperdrive_pool
  • HTTP zones: cf_http_requests_overview
  • Pages Functions: cf_pages_functions_invocations

All points are tagged with account_id and written with the Cloudflare
bucket timestamp so historical backfills land in the right place.

Structure

Mirrors the existing version worker:

  • src/metrics.ts extends the shared pattern with floatField and a custom
    export timestamp so analytics values don't get truncated to integers.
  • src/graphql-client.ts — typed wrapper over the Cloudflare GraphQL API
    using a single JSON filter variable (works around the per-dataset
    filter input types).
  • src/datasets.ts — single registry describing every dataset's dimensions,
    aggregation blocks, and tag/field projection. Adding a new dataset is
    one entry.
  • src/collector.ts — fetches each dataset, converts rows to Metric
    points, and records a self-observation per dataset (cloudflare_metrics_collector_dataset).
  • src/index.ts — fetch handler for /health and /collect (manual trigger)
    plus the scheduled() cron entry point.

Testing

  • Unit tests (32 tests, pnpm run test): line protocol formatting,
    query builder, variable builder, GraphQL client error paths, collector
    dimension/field projection, dataset registry invariants, HTTP handler.
  • Integration tests (pnpm run test:integration, gated on
    CLOUDFLARE_API_TOKEN + CLOUDFLARE_ACCOUNT_ID): every dataset query
    is executed against the real Cloudflare API and validated to match the
    parser's expected shape, plus a full collector run. All 20 datasets
    succeed; the first run against our production account emitted
    11,330 metric points.

Deployment

New Terraform module at deployment/modules/cloudflare/workers/cloudflare-metrics/:

  • api-token.tf — provisions a scoped read-only Cloudflare API token with
    "Account Analytics Read" permission via cloudflare_api_token.
  • worker.tf — worker, version, deployment, and a */5 * * * * cron trigger.

No custom domain — the worker is only triggered by cron.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 10, 2026

Preview Deployments (01b5219)

Worker Preview URL
github-approval-check https://github-approval-check.pr-28.dev.immich.app
hello https://hello.pr-28.dev.immich.cloud
version https://version.pr-28.dev.immich.cloud

@zackpollard
Copy link
Copy Markdown
Member Author

Deployment status

CI has deployed the worker + dashboard to dev (PR-28 stage):

  • Worker: cloudflare-metrics-api-dev-pr-28 with */5 * * * * cron trigger
  • Grafana dashboard: Cloudflare Account Overview uploaded to the cloudflare-metrics (pr-28) folder in dev Grafana

The worker is currently no-op on every cron tick because the analytics API token secret isn't wired up yet — TF_VAR_cloudflare_analytics_api_token is commented out in deployment/.env (see commit 6154677) and the worker defaults to an empty string, logging a cron_error{reason="missing_config"} self-metric.

Follow-up to start data collection

To make the collector start emitting data to VictoriaMetrics/Grafana:

  1. Create a new Cloudflare API token in the dashboard with only the Account Analytics Read permission group, scoped to the target account.
  2. Store it in 1Password as CLOUDFLARE_METRICS_ANALYTICS_TOKEN in both the tf_dev and tf_prod vaults.
  3. Uncomment the line in deployment/.env:
    export TF_VAR_cloudflare_analytics_api_token="op://tf_$ENVIRONMENT/CLOUDFLARE_METRICS_ANALYTICS_TOKEN/password"
    
  4. Re-run CI (push any commit). The next cron tick within 5 minutes will start emitting cf_* metrics.

Why the token isn't generated via Terraform

The cleanest approach would be resource "cloudflare_api_token" { ... } with a hardcoded permission group UUID. I tried that (commit 65e0cf2) but the Terraform service account token doesn't have the User → API Tokens → Write permission required to call POST /user/tokens — returned 403. Both the data-source lookup and the resource creation hit the same endpoint. Getting this fully automated would need either (a) granting the Terraform service account the user-token-write permission, or (b) a dedicated bootstrap token stored out-of-band.

Integration test coverage

pnpm run test:integration is gated on CLOUDFLARE_API_TOKEN + CLOUDFLARE_ACCOUNT_ID and walks every dataset against the live API. Local run against the prod account: 20/20 datasets succeeded, 11,330 metric points emitted across a 1h window. Once the dev token is in place the same tests can run in CI.

@zackpollard zackpollard force-pushed the feat/cloudflare-metrics-exporter branch 2 times, most recently from ef0a947 to 7c7d1d7 Compare April 10, 2026 16:21
@zackpollard
Copy link
Copy Markdown
Member Author

✅ Pipeline is now live end-to-end

Terraform now owns the API token

Following the devtools api-keys pattern, the analytics token is provisioned via cloudflare_api_token using a second aliased provider (cloudflare.bootstrap) that authenticates with the user-level var.cloudflare_api_token instead of the account-scoped token. No manual 1Password step required.

One wrinkle: Cloudflare provider v5 has a bug (#5045) where cloudflare_api_token.value is only populated immediately after creation and gets wiped on refresh. To work around it, there's a terraform_data.analytics_token_generation trigger — bumping its input forces the token to be destroyed and recreated on the next apply, which re-populates .value in the same plan/apply cycle, and the worker binding picks up the fresh value before the worker_version is committed.

The real bug

What took most of the debugging: my first version of the GraphQL client had private readonly fetchImpl: typeof fetch = fetch as a constructor default. In the Workers scheduled runtime that default captured undefined, so every fetchDataset call blew up with a synchronous TypeError before ever reaching the network. The worker was running through all 20 datasets in ~290ms, pushing 20 error metrics + summary to VictoriaMetrics, and that was it — no GraphQL calls at all.

Fix: look up globalThis.fetch lazily at the call site (9303520).

Verified working

Latest cron tick at 16:55:52 UTC:

  • subrequests: 22 (20 to api.cloudflare.com + 1 metric flush + 1 leftover diagnostic beacon, now removed)
  • cpuTimeUs: 98053, wallTime: 13.3s
  • api.cloudflare.com: 20 requests, all 200 OK, 617 KB of response body
  • VictoriaMetrics flush: 609 KB request body, 204 OK

That's the full 20-dataset collection making it end-to-end. The dev Grafana Cloudflare Account Overview dashboard in the cloudflare-metrics (pr-28) folder should start showing data within the next couple of cron cycles.

@zackpollard
Copy link
Copy Markdown
Member Author

✅ Resource name enrichment (D1, queues, zones)

4822b0c + 9d8d8e8 deployed. At the start of each cron tick, the collector now fetches:

  • GET /accounts/{id}/d1/databasedatabase_name tag on cf_d1_* metrics
  • GET /accounts/{id}/queuesqueue_name tag on cf_queue_* metrics
  • GET /zones?account.id={id}zone_name tag on cf_http_requests_overview metrics

Both the *_id / *_tag and the new *_name tags are emitted, so existing queries keep working and the human-readable name is a strict addition. The Grafana dashboard now groups by name with id as a secondary label.

If any of the three lookups fails it's reported via a cloudflare_metrics_resource_lookup{resource,status,error} self-metric; collection still proceeds with whatever caches are populated.

Token permissions

The cloudflare_api_token now includes D1 Read and Queues Read alongside Account Analytics Read. Permission group UUIDs are looked up dynamically via data.cloudflare_api_token_permission_groups_list on the cloudflare.bootstrap provider — that token already has user-level permissions so the lookup works. Zones list doesn't need a dedicated permission.

Verified on 17:30:52 cron tick

  • subrequests: 24 → 20 GraphQL + 3 REST (D1/queues/zones) + 1 VictoriaMetrics flush
  • All 23 api.cloudflare.com calls returned 200
  • CPU 100ms, wall 15.1s

Dashboard legends should now show real names in dev Grafana within the next cycle.

@zackpollard
Copy link
Copy Markdown
Member Author

Zones: per-tag lookup for Pages projects

Root cause: GET /zones?account.id=X only returns the 6 registered zones. The 17 distinct zoneTags in httpRequestsOverviewAdaptiveGroups include ~15 Cloudflare Pages project zones (e.g. immich-app-archive.pages.dev) which aren't in the bulk list but ARE resolvable via GET /zones/{id}.

Fix

  • CloudflareRestClient.getZone(zoneId) added to fetch a single zone by id (returns null on 404 so we can tolerate deleted zones).
  • CloudflareMetricsCollector.resolveMissingZones(rows) runs after fetching the HTTP overview data: it extracts any zoneTag not already in the cache, fetches each in parallel with Promise.allSettled, and emits a cloudflare_metrics_resource_lookup{resource="zones_individual",status,requested,resolved,failed} self-metric.
  • applyResourceTags now falls back to zone_name = zoneTag when the lookup failed entirely, so dashboard legends are never empty.
  • Token gained Zone Read in a second policy block scoped to com.cloudflare.api.account.zone.*. (First attempt nested the resource under the account scope and Cloudflare returned "Partial wildcard scope can only be followed by a match-all object expression" — fixed by using "*" directly.)

Verified on the 17:55 tick

  • 41 calls to api.cloudflare.com — all 200 OK (was 23 × 200 + 20 × 403 before)
  • 20 GraphQL + 3 bulk REST + 17 individual zone lookups + 1 metric flush = 41
  • All cf_http_requests_overview rows now carry zone_name — either the real name for Pages zones or the bulk-listed account zones, or the zoneTag as a last resort

Dashboard legends should now show real hostnames like immich-app-archive.pages.dev within the next cron cycle.

@zackpollard
Copy link
Copy Markdown
Member Author

✅ 1-minute granularity live, batched, and working

Final results on the 20:25:52 cron tick:

Metric Before batching After batching (cold) After batching (warm)
Subrequests / tick ~67 (threw at 50) 27 9
Status scriptThrewException success success
CPU 151 ms 129 ms 110 ms
Wall 17.5 s 6.8 s 5.5 s
Flush body 0 KB (threw before flush) 1.58 MB 1.6 MB

New datasets shipped

  • cf_d1_queries_detail — per-query counts/rows/duration (p50/p95/p99) from d1QueriesAdaptiveGroups, grouped by database_id/database_role/error
  • cf_queue_consumer — consumer concurrency_avg per queue from queueConsumerMetricsAdaptiveGroups
  • cf_http_requests_detail — per-zone detailed HTTP breakdown (count, bytes, visits, timing) from httpRequestsAdaptiveGroups grouped by country/method/status/cache/protocol (zone-scoped)
  • cf_workers_scheduled — cron-specific invocation stats from workersInvocationsScheduled, aggregated client-side into (script, cron, status, minute) buckets with invocations/cpu_time sum/avg/max

firewallEventsAdaptiveGroups is excluded — gated on Business/Enterprise plans and returns authz/"does not have access to the path" at both account and zone scopes on our plan. A comment in datasets.ts documents how to add it if we upgrade.

Granularity

Dropped from datetimeFiveMinutes to datetimeMinute across all 24 datasets. That's the finest grouping the Cloudflare Analytics API exposes — going below 1-minute would require the raw datetime sample stream, which is not aggregated and would explode cardinality. The collector window widened to 12 minutes (still overlapping the previous run by 7 minutes so one missed cron doesn't lose data).

Batching architecture

Three design changes in graphql-client.ts:

  1. fetchAccountBatch — all account-scope datasets that share a filter granularity are emitted as aliased fields under one viewer.accounts query. workersInvocationsScheduled rides along as a workers_scheduled alias in the datetime batch. Partial responses (one field erroring while others succeed) are preserved via executeAllowPartial + groupErrorsByAlias.

  2. fetchZoneBatch — all bulk-listed zones for a zone-scoped dataset get aliased zones(filter: {zoneTag: "..."}) blocks in a single query. Zone tags are validated against /[a-zA-Z0-9_-]+/ before being inlined to block injection.

  3. globalZoneNameCache module-level map persists Pages zone name lookups across Worker isolate invocations. Cold-start costs ~20 /zones/{id} lookups (throttled); warm isolates skip them entirely.

Comment thread .github/workflows/test.yml Fixed
Comment thread .github/workflows/test.yml Fixed
Comment thread .github/workflows/test.yml Fixed
Comment thread .github/workflows/test.yml Fixed
Comment thread .github/workflows/test.yml Fixed
@zackpollard zackpollard force-pushed the feat/cloudflare-metrics-exporter branch 8 times, most recently from e0a6644 to 22b8202 Compare April 17, 2026 20:23
29999 was the wrong workaround — the real issue was the bundled
usage_model, not the value being out of range. 30000 works fine.
The curl was using -f which fails the whole terraform apply if the
PATCH returns any HTTP error. The service-env settings are sticky once
set, so we don't actually need to re-PATCH on every deploy — make it
best-effort and just log the response.

Also forces a fresh isolate via the new version, unblocking the worker
which has been hitting CPU exceedances on a long-lived isolate.
Captures all console.log and runtime traces in Cloudflare's
Observability dashboard so we can see what was happening during
CPU-exceeded incidents (logs are otherwise lost when the invocation
is killed). 100% sampling so we don't miss anything during debugging;
can lower the head_sampling_rate later if log volume becomes a concern.
Observability data shows Cloudflare drops/delays about 25% of our
cron triggers. When it recovers it fires the missed triggers as a
burst (observed up to 10 at once) all at the same wall clock. With
Date.now() as the query anchor, every catch-up invocation ran the
exact same query and the originally-scheduled minutes were lost.

Fix by anchoring the query window to controller.scheduledTime so each
catch-up invocation queries the window it was originally scheduled
for. Also bumps DEFAULT_WINDOW_MS from 3 → 5 min so each minute is
covered by ~6 consecutive ticks instead of 4 — with a 25% miss rate
the probability of all 6 ticks missing drops from 0.4% to 0.024%,
which should essentially eliminate the gaps.

VictoriaMetrics dedupes on (series, timestamp) so the extra overlap
is free on the storage side.
Two bugs: (1) sum by (status) — the metric's label is http_response_status,
not status. Everything was collapsing into a single empty-key series.
(2) rate() on what's effectively a per-minute sum gauge, not a monotonic
counter — the raw sample values are independent per minute, so rate's
counter-reset handling isn't meaningful.

Changed to a direct per-minute sum grouped by http_response_status.
If still gappy after this, investigate whether the underlying samples
are actually missing in VM.
Our collector emits per-minute gauge values (sum of requests/errors/
subrequests within each minute), not monotonic counters. rate() on
these produces display artifacts — apparent gaps when the per-second
derivative can't be computed meaningfully between non-monotonic
samples. Switching to raw metric display shows the actual per-minute
counts directly.

Fixes reported gaps in subrequests-per-script for version-api-prod
between 07:17-07:21 where Cloudflare source data confirmed all
minutes had ~3000 subrequests and all cron ticks ran successfully.
All self-telemetry metrics are per-tick gauges emitted every minute.
rate() is wrong on these (not counters), and 5m/10m increase() windows
are unnecessarily wide for 1-minute data. Changed:
- rate(metric[5m]) → raw metric (rows, points, lookup counts, HTTP)
- increase(metric[5m]) → increase(metric[1m]) (errors, flush errors)
- increase(metric[10m]) → increase(metric[1m]) (cron exceptions)
Per-minute gauge data displayed as disconnected points looks like
random spikes. Set spanNulls=true and lineInterpolation=smooth on
all 9 timeseries panels so data points connect into a continuous
trend line.
Comprehensive fix across all 19 dashboards:

- 84 rate(cf_...[Xm]) → raw metric: our collector emits per-minute
  gauge values, not monotonic counters. rate() on these produced
  erratic spikes whenever traffic dropped between minutes (interpreted
  as counter resets). Raw metric display shows actual per-minute counts.

- 2 increase() windows narrowed from 5m/10m → 1m to match our
  1-minute cron interval.

- 132 timeseries panels styled: spanNulls=true, lineInterpolation=smooth,
  showPoints=never. Per-minute data points now connect into smooth
  trend lines instead of appearing as disconnected spikes.
…lo writes

Cloudflare routes cron triggers to multiple colos simultaneously. Each
colo queries the GraphQL analytics API and writes to VictoriaMetrics.
Due to eventual consistency, different colos can return different
aggregation counts for the same minute — one colo might see 3000
subrequests while another only sees 450 (partial data). VM's
last-write-wins causes the stored value to oscillate between the
competing writes, producing 6-7x swings on the dashboard.

Fix by wrapping all 128 timeseries metric queries with
max_over_time(metric[2m]). This takes the highest value for each
sub-series over a 2-minute window, ensuring the most complete colo's
data wins regardless of write order. For sum/count metrics, max picks
the most complete data. For max/p99 metrics, max is also correct.

Instant queries (stat panels using increase([24h])) are excluded since
they aggregate over long windows where the oscillation averages out.
Our metrics are per-minute gauges (value = count within that minute),
not monotonic counters. increase() computes (last - first) which is
meaningless for gauges — it could return near-zero for "total requests
in 24h" even when there were millions.

Changed:
- 65 stat panels: increase(metric[24h]) → sum_over_time(max_over_time(metric[1m])[24h:1m])
  Correctly totals all per-minute values over the window.
- 20 billing queries: increase(metric[1h]) → sum_over_time(metric[1h:1m])
  Correctly computes hourly cost from per-minute data.
- 6 self-telemetry queries: increase(metric[1m]) → raw metric or sum_over_time

Zero rate() or increase() remaining across all 19 dashboards. Every
cf_ metric query now uses the appropriate *_over_time wrapper:
- max_over_time: timeseries (multi-colo dedup)
- sum_over_time: totals (stat panels, billing, error tables)
- last_over_time: storage snapshots (carry forward)
Panels were labeled with per-second units (reqps, ops, Bps) from when
queries used rate(). Now that we display raw per-minute gauge values:

- 42 panels: reqps → short (per-minute count, not per-second)
- 9 panels: ops → short (same)
- 4 panels: Bps → bytes (per-minute byte total, not bytes/sec)
- 13 panel titles: removed "Rate" since we show counts not rates
- Guard division by zero in scheduled worker CPU avg (collector.ts)
- Skip NaN/Infinity values in InfluxDB line protocol serialization
- Remove dead applyResourceTags() function from emit.ts
- Use max() instead of sum() in alert PromQL for multi-colo safety
- Improve curl PATCH logging in worker.tf to surface failures
- Add test suites: emit, metric-providers (escaping + NaN), flush-state,
  resource-cache, scheduled handler window logic
…-deploy PATCH

the cloudflare_worker_version resource does support usage_model
(deprecated but functional); setting it on the version itself is the
durable fix. the prior post-deploy service-env PATCH was unreliable —
after commit 3eda62b the worker ran standard for ~45 min then reverted
to bundled on its own, causing 2+ hours of exceededCpu crons.
empirically verify what usage_model new cloudflare workers default to
at the version level. scheduled handler burns >50ms of cpu so if the
default is `bundled` (50ms cap) the cron will die with `exceededCpu`,
and if `standard` it completes with `outcome=ok`. no usage_model field
is set on cloudflare_worker_version in this commit — a follow-up will
add `usage_model = "standard"` to confirm the fix.
commit a proved new workers default to bundled (50ms cap) — first cron
on cpu-test-api-dev-pr-28 fired exceededCpu at cpu=50ms wall=51ms. add
usage_model = "standard" to cloudflare_worker_version and re-verify
via wrangler tail that cron outcome flips to ok with cpu > 50ms.
experiment confirmed empirically:
- new cloudflare workers deployed via terraform default to
  usage_model=bundled (50ms cpu cap). cpu-test first cron:
  exceededCpu at cpu=50ms wall=51ms.
- setting usage_model="standard" on cloudflare_worker_version
  lifts the cap. same handler, new version: cpu=2050ms wall=2105ms
  — 41x the bundled cap.
- the worker has been deleted from cloudflare via the api; postgres
  tf state schema services_cf_workers_cpu-test_dev_pr-28 is orphaned
  but harmless (module dir removed so terragrunt run-all won't
  discover it).
re-create cpu-test worker with usage_model=standard + limits.cpu_ms=30000
to see if it also reverts to bundled after some time. bump
cloudflare-metrics source to force a new version (the current one
reverted to bundled after ~2 hours of stable operation). will poll the
usage_model field to catch exactly when it flips.
previous deploy (ea725556) at 21:58 UTC reverted to bundled at 23:06 UTC
(~1h8m post-deploy). bump FORCE_NEW_VERSION to trigger a new version and
confirm both (a) redeploy fixes the 50ms cap immediately, and (b) the
~1 hour revert pattern repeats on the new version.
@zackpollard zackpollard force-pushed the feat/cloudflare-metrics-exporter branch from 5a61fc0 to 4fd3d95 Compare April 22, 2026 16:47
- prettier-format resource-cache.test.ts
- pass explicit undefined to normalizeTagValue to satisfy TS2554
- add --passWithNoTests to cpu-test vitest (experimental worker has no tests)
@zackpollard zackpollard force-pushed the feat/cloudflare-metrics-exporter branch from 4fd3d95 to b53dfc5 Compare April 22, 2026 16:55
- delete apps/cpu-test and its terraform module — experiment complete
- drop FORCE_NEW_VERSION redeploy trigger from cloudflare-metrics
@zackpollard zackpollard marked this pull request as ready for review April 22, 2026 17:09
@zackpollard zackpollard merged commit ebeb6d6 into main Apr 22, 2026
10 checks passed
@zackpollard zackpollard deleted the feat/cloudflare-metrics-exporter branch April 22, 2026 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants